ML Course Part 1 - Introduction to Machine Learning

Alexandre Bry

Introduction

Definition

Machine Learning (ML)

Feeding data into a computer algorithm in order to learn patterns and make predictions in new and different situations.

ML Model

Computer object implementing a ML algorithm, trained on a set of data to perform a given task.

Definitions

Categories of ML

Type of dataset

  • Supervised: for each input in the dataset, the expected output is also part of the dataset
  • Unsupervised: for each input in the dataset, the expected output is not part of the dataset
  • Semi-supervised: only a portion of the inputs of the dataset have their expected output in the dataset
  • Reinforcement: there is no predefined dataset, but an environment giving feedback to the model when it takes actions

Type of output

  • Classification: assigning one (or multiple) label(s) chosen from a given list of classes to each element of the input
  • Regression: assigning one (or multiple) value(s) chosen from a continuous set of values
  • Clustering: create categories by grouping together similar inputs

Dataset

Dataset

Dataset

A collection of data used to train, validate and test ML models.

Content

Instance (or sample)

An instance is one individual entry of the dataset.

Feature (or attribute or variable)

A feature is a type of information stored in the dataset about each instance.

Label (or target or output or class)

A label is a piece of information that the model must learn to predict.

Subsets

Dataset subsets

A ML dataset is usually subdivided into three disjoint subsets, with distinctive role in the training process:

  • Training set: used during training to train the model,
  • Validation set: used during training to assess the generalization capability of the model, tune hyperparameters and prevent overfitting,
  • Test set: used after training to evaluate the performance of the model on new data it has not encountered before.

Overview of ML methods

Supervised Learning

Linear Regression

Used for predicting continuous values:

  • Simple: \(y = \alpha + \beta x\)
  • Multiple: \(y = \alpha + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_n x_n\)
  • Polynomial: \(y = \alpha + \beta_1 x + \beta_2 x^2 + \cdots + \beta_n x^n\)
  • and many others…

Simple Linear Regression1

Logistic Regression

Used for binary classification problems:

  • Binomial: only two possible categories
  • Multinomial: three or more possible categories
  • Ordinal: three or more possible categories which are ordered

Binomial Logistic Regression2

Decision Trees

A tree-like structure used for both classification and regression

Decision Tree3

Random Forests

An ensemble method that combines multiple decision trees:

  • Train independently \(B\) trees using:
    • Bagging: each tree is fitted on a random subset of the training set
    • Feature bagging: each split in the decision tree (i.e. each node) is chosen among a subset of the features
  • Take a decision by aggregating individual decisions of each tree

Boosting

An ensemble method that combines weak learners (usually decision trees) to form a stronger model:

  • Choose a simple base learner (e.g. small decision trees with fixed number of leaves)
  • Repeatedly:
    • Train a new base learner on the weighted training set
    • Add this new learner to the ensemble
    • Give more weight in the training set to misclassified data

\(k\)-Nearest Neighbors (KNN)

A non-parametric method for classification and regression

?

k-NN Classification4

Naive Bayes

A probabilistic classifier based on Bayes’ theorem

Support Vector Machines (SVM)

Used for classification and regression, effective in high-dimensional spaces:

  • Separate the feature space using optimal hyperplanes
  • Features are mapped in a higher dimensional space to allow to fit non-linearly in the original feature space

Ø

SVM kernel trick: map features to a higher dimensional space to be able to separate classes using a hyperplane5

Unsupervised Learning

\(k\)-Means Clustering

A method for partitioning data into \(k\) clusters:

  • \(k\) must be chosen before
  • Create clusters around iteratively improved center points
  • Classical method with lots of variation

\(k\)-Means Clustering convergence process6

Hierarchical Clustering

Builds a hierarchy of clusters using either agglomerative or divisive methods:

  • Build a full hierarchy top-down (divisive) or bottom-up (agglomerative)
  • Create any number of clusters by cutting the tree

image/svg+xml a b c d e bc de def bcdef abcdef f

Hierarchical Clustering7

Density-Based Spatial Clustering of Applications with Noise (DBSCAN)

Clustering based on the density of data points:

  • Divides points in 4 categories: core points (in red below), directly reachable (yellow), reachable and outliers (blue)
  • Only two parameters: radius size (\(\epsilon\)) and number of neighbors to be core (\(min_{pts}\))

image/svg+xml A C B N

DBSCAN Illustration8

image/svg+xml

DBSCAN Result9

Principal Component Analysis (PCA)

Dimensionality reduction technique to project data into lower dimensions:

  • Project data into a space of lower dimension
  • Keep as much variance (so as much information) as possible

PCA Illustration10

t-Distributed Stochastic Neighbor Embedding (t-SNE)

A nonlinear dimensionality reduction technique primarily used for visualization of high-dimensional data

t-SNE Result on MNIST dataset11

Reinforcement Learning

Definitions

image/svg+xml A Simple Maze Puzzle, by Adam Stanislav Okruženje環境Environment Agent智能體Agent Akcija行動Action Tumač觀察器Interpreter Nagrada犒賞Reward Stanje狀態State

Reinforcement Learning Framework12

Q-Learning

A value-based reinforcement learning algorithm:

  • Learn the expected output of each action in each situation
  • Limited to discrete and simple environments
  • Neural Network variants (like Deep Q-Learning) allow to handle more complex environments

Q-Learning Table13

State-Action-Reward-State-Action (SARSA)

The On Policy version of Q-Learning (which is Off Policy). On Policy and Off Policy differ on the policy that is used during training to take the decisions:

  • The exact same policy that is currently learnt for On Policy
  • Another policy for Off Policy

Policy Gradient Methods

Optimize the policy by adjusting parameters in the direction that improves performance (e.g., REINFORCE algorithm)

Actor-Critic Methods

Combination of an Actor and a Critic learning simultaneously:

  • The Actor learns the policy through a parametrized function
  • The Critic estimates the value of each action and gives feedback to the Actor

Introduction to Neural Networks (NN)

Definition

Neural Network (NN)

Subtype of ML model inspired from brains. Composed of several interconnected layers of nodes capable of processing and passing information.

Structure of a NN model

Basic Elements

Neuron

Takes multiple inputs, sums them with weights and passes the result as output.

Layer

Set of similar neurons taking different inputs and/or having different weights.

Neural Network (NN)

Sequence of layers.

Linear Functions

Linear Function

Function that can be written like this: \[ f(\alpha_1, \cdots, \alpha_n) = (\beta_{1,1} \alpha_1 + \cdots + \beta_{1,n} \alpha_n, \cdots, \beta_{m,1} \alpha_1 + \cdots + \beta_{m,n} \alpha_n) \]

Composition of Linear Functions

The composition of any number of linear functions is a linear function.

Activation Functions - Definition

Activation Function

Function applied to the output of a NN layer (i.e. to the output of each of its neurons) to introduce non-linearity to the model.

Activation functions allow to approximate much more complex functions, using a sequence of intertwined affine layers and activation layers.

Activation Functions - Examples

Rectified linear unit (ReLU)14

Hyperbolic tangent (tanh)15

Logistic, sigmoid, or soft step16

Leaky rectified linear unit (Leaky ReLU)17

Affine layers

Fully Connected

The most basic layer, in which each output is a linear combination of each input (before the activation layer)

Fully Connected Layer18

Fully Connected Layer19

Convolutional

A layer combining geographically close features, used a lot to process rasters.

2D Convolutional Layer20

2D Convolutional Layer21

Recurrent

Type of layers designed to process sequential data such as text, time series data, speech or audio. Works by combining input data and the state of the previous time step.

The two main variants of recurrent layers are:

image/svg+xml xt-1 ct-1,ht-1 ot-1 xt ot ct+1,ht+1 xt+1 ot+1 LSTM unit σ σ tanh σ tanh ct-1 ht-1 xt ht ct Ft It Ot ht ... ...

Long Short-Term Memory (LSTM)22

image/svg+xml xt-1 ht-1 ot-1 xt ot ht+1 xt+1 ot+1 GRU unit σ tanh ht-1 xt ht Rt ht σ 1- Zt ... ...

Gated Recurrent Unit (GRU)23

Nowadays, transformer architectures are however preferred to process sequential data.

Pooling

A type of layers used to reduce the number of features by merging multiple features into one. There are multiple kinds of pooling layers, the most simple ones being Maximum Pooling and Average Pooling.

Max Pooling Example24

Residual

A Residual Block aims at stabilizing training and convergence of deep neural networks (with a large number of layers), by adding the input of a given layer to the output of another layer further down in the architecture.

Residual Block skipping two layers25

Attention

Attention aims at determining relative importance of each part of the input to make better predictions. It is used a lot in natural language processing (NLP) and image processing.

Attention mechanism in seq2seq with RNN26

And a lot more…

  • Dropout: randomly drop out some of the nodes during training to reduce overfitting
  • Batch Normalization: normalize the input of each layer across the batch to improve training stability and speed
  • Layer Normalization: normalize the input of each layer across the features to improve training stability and speed
  • Embedding: transforms discrete input data into continuous vectors with lower-dimensional space
  • Flatten: convert multi-dimensional data into 1D data that can be fed into fully connected layers

NN training

Loss Function - Definition

A loss function is a mathematical function that quantifies the difference between the network’s predicted output and the actual target values. The goal during training is to minimize this loss by adjusting the model’s weights, using gradient descent.

The most common loss functions are:

  • Mean Squares Error (MSE) for regression
  • Cross-entropy Loss for classification

Loss Function - Differentiable

To be able to perform gradient descent, the loss function must be differentiable, which means continuous (no jump) and smooth (no sudden change of direction).

Examples of differentiable functions[1]

Examples of non-differentiable functions[1]

Loss Function - Convex

To get the best results when performing gradient descent, it is also better if the function is convex. The simplest definition of convexity is that if you trace a straight line between two points on the curve, the curve will be below the segment between the two points.

Example of convex and non-convex functions[1]

Gradient Descent - Definition

Gradient Descent

The process of iteratively computing and following the direction of the gradient of a function to (hopefully) reach the minimum value of the function (if it exists).

Gradient Descent works because at any point in the definition space of the function, the gradient points in the direction of the steepest angle. So locally, following this direction is the quickest way to get to the lower value of the function. If we come back to the requirements listed before:

  • Differentiable functions are functions where the gradient exists everywhere
  • Convex functions are convenient for gradient descent because they have only one minimum value and slowly going down the function will always lead to the minimum value.

Gradient Descent - Algorithm

Gradient Descent boils down to iteratively:

  1. Compute the gradient of the loss function at the current point
  2. Make a step towards the direction of the gradient to a new point
  3. Repeat step 1 until we stop

In this process, the three things that have to be defined are:

  • The starting point (weights initialization)
  • The size of the steps (learning rate)
  • The condition to stop

Gradient Descent - Weights initialization

The starting point is defined by the first output of the model, and therefore by the initial values of the weights of the model. There are numerous methods to initialize the weights, but the most common one is to randomly initialize them using a centered and normalize Gaussian distribution.

Gradient Descent - Learning rate

The gradient gives us a direction and a norm, but this norm is arbitrary and has to be rescaled using what we call the learning rate. The learning rate doesn’t define the size of the steps, but the scalar factor to apply to the gradient’s norm, which means that the norm still plays a crucial role.

The choice of the learning rate is crucial to hopefully converge quickly to the global minimum loss.

Example of gradient descent on the same function with different learning rates[1]

Gradient Descent - Stop condition

The stop condition determines when you decide to stop the algorithm. An easy solution is to choose a number of steps before launching the algorithm, but this will either imply useless computations after the algorithm has reached a final point, or stopping too early and not get the best results possible.

Therefore, although there are more complex methods, the most common and simple process is to monitor the value of the loss, memorize the lowest value ever reached, and stop when there has been a given number of steps without any improvement to the best value. Then, we usually keep the model weights corresponding to this best value.

Gradient Descent - Unlucky examples

Examples with a saddle point[1]

Animated example with a saddle point[1]

Backpropagation

Gradient Descent is beautiful, but right now, we only know in which direction (the gradient) the output of the model should go. To transmit this information to the weights of the layers of the model, we use backpropagation.

Backpropagation

The process of computing the gradient of the weights of each layer of the model and modify them accordingly. The name comes from the process starting with the last layer of the model and propagating incrementally to the first layer.

Architectures of NN

Feedforward Neural Networks (FNNs)

Convolutional Neural Networks (CNNs)

Recurrent Neural Networks (RNNs)

Generative Adversarial Networks (GANs)

Autoencoder Networks

Transformer Networks

Miscellaneous

Playground

Playground

YOLOv8 Model

Usual pipeline

Overview

  1. Data acquisition
  2. Data preprocessing
  3. Model selection
  4. Model evaluation
  5. Final model training

Data acquisition

Gather the data, potentially from multiple different sources. Choosing the right sources can also depend on the choices made in the next steps.

Data preprocessing

Different issues

Multiple sources of issues and steps to perform: 1. Handle different formats 2. Remove outliers (mostly for raw data) 3. (Optionally) extract features 4. Handle missing data 5. Normalize

Why normalization?

Idea

A priori all features have the same importance, so none of them should have an advantage. Therefore, having features with larger values than others would be detrimental.

Usually, all features are individually normalized over the whole dataset, to obtain a distribution with an average of 0 and a standard deviation of 1:

\[ \begin{align*} \hat{X} & = \sum\limits_{j=0}^n X_j \\ \sigma_X & = \sum\limits_{j=0}^n (X_j - \hat{X})^2 \\ \forall k \in [0, \cdots, n ], X_k & = \frac{X_k - \hat{X}}{\sigma_X} \end{align*} \]

Model selection

  • Type of model (ML, NN, DL, …)
  • Complexity:
    • Number of features
    • Type of output
    • Size of the layers (for NN)
    • Number of layers (for NN)
  • Hyperparameters

Model training

  • Loss selection: depends on the task, the objectives, the specific issues to solve
  • Training process selection (lots of different tweaks and improvements can be implemented in NN training)
  • Hyperparameter tuning, by repeatedly:
    • Selecting one or multiple configurations of hyperparameters
    • Training the model one or multiple times
    • Determining the best hyperpatameters

Model evaluation - Criteria

Criteria selection among the many possible ones:

  • For classification:
    • Accuracy: for balanced datasets
    • Precision: when false positives are costly
    • Recall: when false negatives are costly
    • F1-Score: when class distribution is unbalanced
  • For regression:
    • Mean Absolute Error (MAE)
    • Mean Square Error (MSE): more sensitive to large errors than MAE

Model evaluation - Cross-validation

Cross-validation

Method to estimate real performance of the model by: 1. Splitting the dataset in multiple parts (usually 5) 2. For different combinations of these parts (usually 5), training and evaluating the model

image/svg+xml 疊代 1Iteration 1 疊代 2Iteration 2 疊代 3Iteration 3 疊代 kIteration k 測試用嘅數據Test data 訓練用嘅數據Training data 所有數據整體All data ... ... ... ...

Cross-validation27

Final model training

Once the data is preprocessed, the model is selected, the hyperparameters chosen and optimized, the final model can be trained multiple times to keep the best one.

Challenges

Data

  • Quality
  • Diversity
  • Biases and fairness (example of crime prediction)

Biases and fairness

Model architecture, training and evaluation also plays a major role in preventing biases, and can sometimes allow to build unbiased models using biased data.

Underfitting and Overfitting

Definitions

Underfitting

When a model is too simple to properly extract information from a complex task. Can also be explained by key information missing in the input features.

Overfitting

When a model is too complex to properly generalize to new data. Happens often when a NN is trained too long on a dataset that is not diverse enough and learns the noise in the data.

Illustrations

Underfitting and overfitting on a regression task[2]

Underfitting and overfitting on a classification task[2]

Solutions

Solution Underfitting Overfitting
Complexity Increase Reduce
Number of features Increase Reduce
Regularization Reduce Increase
Training time Increase Reduce

General strategies:

  • Cross-validation to identify problems
  • Grid search/random search to tune hyperparameters and balance between underfitting and overfitting
  • Ensemble methods to reduce overfitting by using many smaller models instead of one big

Interpretable and Explainable

Definitions

Interpretable

Qualifies a ML model which decision-making process is straightforward and transparent, making it directly understandable by humans. This requires to restrict the model complexity.

Explainable

Qualifies a ML model which decision-making process can be partly interpreted afterwards using post hoc interpretation techniques. These techniques are often used on models which are too complex to be interpreted.

Python libraries

Data manipulation

Libraries

  • NumPy
    • Fast numerical operations
    • Matrices with any number of dimensions (called arrays)
    • Lots of convenient operators on arrays
  • Pandas
    • Can store any type of data
    • 1D or 2D tables (called DataFrames)
    • Lots of convenient operators on DataFrames

NumPy

import numpy as np
array = np.array([[0, 1], [2, 3]])
print(array)
[[0 1]
 [2 3]]
print(5 * array)
[[ 0  5]
 [10 15]]
print(np.pow(array, 3))
[[ 0  1]
 [ 8 27]]
print(array @ array)
[[ 2  3]
 [ 6 11]]
print(np.where(array < 2, 10 - array, array))
[[10  9]
 [ 2  3]]

Pandas

import pandas as pd
import numpy.random as npr
df = pd.DataFrame([
    ["Pi", 3.14159, npr.randint(-100, 101, (2, 2))],
    ["Euler's number", 2.71828, npr.randint(-100, 101, (2, 2))],
    ["Golden ratio", 1.61803, npr.randint(-100, 101, (2, 2))]
  ], columns = ["Names", "Values", "Random numbers because why not"])
print(df)
            Names   Values Random numbers because why not
0              Pi  3.14159         [[-34, 88], [-80, -5]]
1  Euler's number  2.71828         [[21, 37], [-62, -85]]
2    Golden ratio  1.61803        [[88, -43], [-93, -94]]
print(df[df["Values"] > 2])
            Names   Values Random numbers because why not
0              Pi  3.14159         [[-34, 88], [-80, -5]]
1  Euler's number  2.71828         [[21, 37], [-62, -85]]
print(df[df["Names"].str.contains("n")])
            Names   Values Random numbers because why not
1  Euler's number  2.71828         [[21, 37], [-62, -85]]
2    Golden ratio  1.61803        [[88, -43], [-93, -94]]

ML

SciPy

Scientific and technical computing based on NumPy. Documentation here

import scipy as sp
import numpy as np
# Define a linear system Ax = b
A = np.array([[3, 2], [1, 4]])
b = np.array([6, 8])

# Solve the system
x = sp.linalg.solve(A, b)
print("Solution to the linear system Ax = b:", x)
Solution to the linear system Ax = b: [0.8 1.8]
# Define a function to integrate
def integrand(x):
    return np.exp(-x**2/2)/np.sqrt(2*np.pi)

# Integrate the function from 0 to infinity
result, error = sp.integrate.quad(integrand, 0, np.inf)
print("Integral of exp(-x**2/2)/sqrt(2*np.pi) from 0 to infinity:", result)
Integral of exp(-x**2/2)/sqrt(2*np.pi) from 0 to infinity: 0.4999999999999999

scikit-learn

Documentation here

import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.datasets import make_regression, load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
# Generate data, create and fit a Linear Regression model
X, y = make_regression(n_samples=100, n_features=1, noise=0.1)
model = LinearRegression()
model.fit(X, y)

print("Coefficient:", model.coef_, "Intercept:", model.intercept_)
Coefficient: [80.48982499] Intercept: -0.007374778452941655
# Load the iris dataset and split it in training and test
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and fit a random forest classifier on training data, and use it
clf = RandomForestClassifier()
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))
Accuracy: 1.0

NN

PyTorch

TensorFlow

Keras

Visualization

Matplotlib

Plotly

Seaborn

Resources

Machine Learning

Introduction to Machine Learning

[1] Robert Kwiatkowski. “Gradient descent algorithm: A deep dive.” Available at: https://towardsdatascience.com/gradient-descent-algorithm-a-deep-dive-cf04e8115f21.
[2] GeeksforGeeks. “Underfitting and overfitting in machine learning.” Available at: https://www.geeksforgeeks.org/underfitting-and-overfitting-in-machine-learning/.

Footnotes

  1. By Anscombe.svg: Schutz(label using subscripts): Avenue - Anscombe.svg, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=9838454

  2. By Canley - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=116449187

  3. Black Dragon https://stackoverflow.com/questions/31786347/how-to-make-python-decision-tree-more-understandable

  4. By Antti Ajanki AnAj - Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=2170282

  5. By Original: Alisneaky Vector: Zirguezi - Own work based on: Kernel Machine.png, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=47868867

  6. By Chire - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=59409335

  7. By [[:File:Hierarchical_clustering_diagram.png#file|]]: Stathis Sideris on 10/02/2005derivative work: Mhbrugman (talk) - [[:File:Hierarchical_clustering_diagram.png#file|]], CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=7344806

  8. By Chire - Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=17045963

  9. By Chire - Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=17085332

  10. By Nicoguaro - Own work, CC BY 4.0, https://commons.wikimedia.org/w/index.php?curid=46871195

  11. By Kyle McDonald - https://www.flickr.com/photos/kylemcdonald/26620503329/, CC BY 2.0, https://commons.wikimedia.org/w/index.php?curid=115726949

  12. By Megajuice - Own work, CC0, https://commons.wikimedia.org/w/index.php?curid=57895741

  13. By LearnDataSci - Own work - Article, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=69947708

  14. By Laughsinthestocks - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=44920600

  15. By Laughsinthestocks - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=44920568

  16. By Laughsinthestocks - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=44920533

  17. By Laughsinthestocks - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=46839644

  18. Diego Unzueta https://builtin.com/machine-learning/fully-connected-layer

  19. Diego Unzueta https://builtin.com/machine-learning/fully-connected-layer

  20. Diego Unzueta https://builtin.com/machine-learning/fully-connected-layer

  21. Diego Unzueta https://builtin.com/machine-learning/fully-connected-layer

  22. By fdeloche - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=60149410

  23. By fdeloche - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=60466441

  24. By Daniel Voigt Godoy - https://github.com/dvgodoy/dl-visuals/, CC BY 4.0, https://commons.wikimedia.org/w/index.php?curid=150823502

  25. By LunarLullaby - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=131458370

  26. By Google - https://github.com/google/seq2seq, Apache License 2.0, https://commons.wikimedia.org/w/index.php?curid=150802792

  27. By Gufosowa - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=82298768